-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Design document for RemoteRepository DB normalization #7169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design document for RemoteRepository DB normalization #7169
Conversation
Initial document to discuss the path forward to implement #2759
This makes sense to me. I wonder if it's worth considering only storing the data we actually care about, and creating model fields for it, instead of the entire JSON blob? I guess we've hit times where we wanted extra data that we didn't store previously, but we could always re-query the API to get it, if needed. That would also cut down a good bit on the amount of data storage required. |
That's a good question. One thing I'd mention is that having the I think it's a trade off between adding more work from our side (add these fields, populate them and maintain this code) vs. db storage. There is a lot of API URLs in the response that most of them are useless for our case. The rest of the data looks useful, though. Maybe keep using a JSON field but removing these URLs before storing it? I have a slightly preference to keep the whole json as is and rely on Postgres JSON internals, because seems simpler and safer. |
Also, something to note is that the |
Yea, I think at least trimming known useless data would be good. I mostly just want to get the data size reduced. The other issue is, what is the API response format changes? It seems like that could wreck havoc on our side if we aren't parsing the incoming content to get the values we want. |
I replied too quickly last time. We are already parsing the relevant data and storing it as So, we may not require the
If the API response change, even if we are parsing the relevant fields, it may break. We do not have have a good way to protect ourselves here. To avoid this, we should pin to an API version on each service (e.g. https://developer.github.com/v3/media/#request-specific-version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things have changed since I wrote this document. Besides, SSO is already implemented and working with the current modeling: no JSON is required at all. We could completely remove this field if strongly required or think more about it.
I added some comments to keep in mind when implementing this design.
|
||
* Keep ``RemoteRepository`` in sync with GitHub repositories. | ||
* Delete ``RemoteRepository`` objects deleted from GitHub. | ||
* Listen to GitHub events to detect ``full_name`` changes and update our objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add a remote_id=
field, this sync is not even needed.
* Save only one ``RemoteRepository`` per GitHub repository. | ||
* Use an intermediate table between ``RemoteRepository`` and ``User`` to store associated remote data for the specific user. | ||
* Make this model usable from our SSO implementation. | ||
* Use Post ``JSONField`` to store associated ``json`` remote data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be considered a "Nice to have" in case we don't have the time to make all the changes required to depend on PostgreSQL in tests. If we keep the admin=
field in the relationship, filtering over the JSON is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably wait here until we migrate to Django 3.x that implements this natively and avoid making this migration twice. We don't really need json field right now.
@ericholscher I updated the document with the latest changes in context. This is ready to merge IMO, and we can workaround some implementation details in the PR that implements this. I'm tempt to say that we can remove the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is required. It's ready for review and merge, IMO. The implementation will take some time, tho. |
…humitos/design-remote-repository-normalization
I'm merging this since we are already implementing this and this document reflects almost exactly what we are doing. I just added a "Rollout plan" section with more info about this. |
…humitos/design-remote-repository-normalization
Initial document to discuss the path forward to implement
Reference: #2759